docs(longhaul): add long-haul test design document by WentingWu666666 · Pull Request #400 · documentdb/documentdb-kubernetes-operator

WentingWu666666 · 2026-06-10T13:41:12Z

Part 1/5 of #348 split.

Scope

Adds docs/designs/long-haul-test-design.md (367 lines, new file).

Content

Goals and non-goals for the long-haul test
Architecture diagram (writer/verifier loop, operations scheduler, monitor, journal, report)
Data plane invariants: majority writes, gap detection, checksum validation
Failure modes and disruption window policy
HA-gated upgrade scenario, spec.instancesPerNode scaling
Relationship to test/e2e

Verification

Docs-only; no build/test impact.

1/5 (this PR): design doc
5/5 (test: extract test/shared module from e2e helpers for long-haul reuse #401): test/shared module extraction + e2e migration (can land in parallel)
2/5: long-haul driver code (depends on 5/5)
3/5: CI/CD workflows (depends on 2/5)
4/5: auto-upgrade + verification (depends on 2/5 + 3/5)

Copilot

Pull request overview

This PR adds (or refreshes) the long-haul (canary) test driver for the DocumentDB Kubernetes Operator: a standalone Go module under test/longhaul/ with writers/verifiers, disruption-window journaling, a weighted-random operations scheduler (scale + DocumentDB upgrade), health/leak monitoring, periodic reporting to a longhaul-report ConfigMap, plus in-cluster Deployment packaging and GitHub Actions workflows (build/deploy/monitor). It also updates the long-haul design document and supporting README/manifests.

Changes:

Introduces the long-haul test driver Go module (test/longhaul/) with workload, monitor, operations, journal, and reporting components.
Adds Kubernetes deployment artifacts (Deployment + RBAC + setup manifest) and CI workflows to build, deploy, and monitor the long-haul canary.
Updates docs/designs/long-haul-test-design.md to describe the architecture, invariants, and operations catalog.

Reviewed changes

Copilot reviewed 37 out of 38 changed files in this pull request and generated 24 comments.

Show a summary per file

File	Description
`test/longhaul/workload/writer.go`	Writer loop that inserts majority-acknowledged documents with checksum/seq tracking.
`test/longhaul/workload/verifier.go`	Periodic verifier scanning for gaps and checksum mismatches under majority read concern.
`test/longhaul/workload/metrics.go`	Atomic counters + snapshot helpers for workload metrics.
`test/longhaul/report/suite_test.go`	Ginkgo suite bootstrap for `report` package tests.
`test/longhaul/report/report.go`	Markdown report generator for long-haul run state.
`test/longhaul/report/checkpoint.go`	Periodic reporter writing stdout + `longhaul-report` ConfigMap.
`test/longhaul/report/checkpoint_test.go`	Unit tests for ConfigMap create/update/result-field behavior.
`test/longhaul/report/alert.go`	GitHub Actions annotations for pass/fail/leak warnings.
`test/longhaul/README.md`	Driver usage, deployment instructions, and config reference.
`test/longhaul/operations/upgrade.go`	DocumentDB version upgrade operation using desired-version ConfigMap + steady-state gate.
`test/longhaul/operations/suite_test.go`	Ginkgo suite bootstrap for `operations` tests.
`test/longhaul/operations/scheduler.go`	Weighted-random scheduler with cooldown + steady-state gating + disruption windows.
`test/longhaul/operations/scheduler_test.go`	Unit tests for weighted selection + cooldown short-circuiting.
`test/longhaul/operations/scale.go`	Scale up/down operations with patch-confirmation polling and outage policies.
`test/longhaul/monitor/suite_test.go`	Ginkgo suite bootstrap for `monitor` tests.
`test/longhaul/monitor/leakdetect.go`	Linear-regression leak detector over sampled memory/CPU.
`test/longhaul/monitor/k8sclient.go`	Real Kubernetes client implementation (pods/CR/metrics, CR patching).
`test/longhaul/monitor/health.go`	Health monitor with steady-state tracking and recovery waits.
`test/longhaul/monitor/health_test.go`	Unit tests for steady-state and wait semantics using a fake ClusterClient.
`test/longhaul/journal/suite_test.go`	Ginkgo suite bootstrap for `journal` tests.
`test/longhaul/journal/policy.go`	Outage policy + disruption window evaluation logic.
`test/longhaul/journal/policy_test.go`	Unit tests pinning boundary behavior of the verdict oracle.
`test/longhaul/journal/journal.go`	Thread-safe append-only journal + disruption-window tracking.
`test/longhaul/journal/journal_test.go`	Unit tests for journal behavior and concurrency safety.
`test/longhaul/go.sum`	Dependency lockfile for the long-haul module.
`test/longhaul/go.mod`	New standalone Go module for the long-haul driver.
`test/longhaul/Dockerfile`	Multi-stage container build for the long-haul binary.
`test/longhaul/deploy/setup.yaml`	Namespace + DocumentDB CR bootstrap manifest for the canary cluster.
`test/longhaul/deploy/rbac.yaml`	ServiceAccount + Role/Bindings + metrics ClusterRole for the driver.
`test/longhaul/deploy/deployment.yaml`	ConfigMap-driven Deployment manifest for in-cluster execution.
`test/longhaul/config/suite_test.go`	Ginkgo suite bootstrap for config tests.
`test/longhaul/config/config.go`	Env-driven config loading + validation for the driver.
`test/longhaul/config/config_test.go`	Unit tests for env parsing, validation, and enable flag parsing.
`test/longhaul/cmd/longhaul/main.go`	Standalone binary wiring: Mongo workload, ops scheduler, monitoring, reporting.
`docs/designs/long-haul-test-design.md`	Updated design doc describing architecture, invariants, and phases.
`.github/workflows/longhaul-monitor.yaml`	Hourly monitor workflow for Deployment health/report staleness + version publishing.
`.github/workflows/longhaul-image-build.yml`	Workflow to build/push the long-haul driver image to GHCR.
`.github/workflows/longhaul-deploy.yml`	Workflow to roll the driver Deployment on AKS (manual + workflow_run).

documentdb-triage-tool · 2026-06-10T13:59:34Z

🤖 Auto-triaged by documentdb-triage-tool.

Applied: test, CI/CD, documentation, dependencies
Project fields suggested: Component test · Priority P3 · Effort XL · Status Needs Review
Confidence: 0.97 (mixed)

Reasoning

component from path globs (test, ci, docs, dependencies); effort from diff stats (5023+0 LOC, 38 files); LLM: Single-file docs-only update to a design document with no build or test impact, part of a larger split PR series.

If a label is wrong, remove it manually and ping @patty-chow so the rules can be tuned. The bot will not re-label items that already have component labels.

Adds the design doc covering goals, architecture (writer/verifier loop, operations scheduler, monitor, journal, report), data plane invariants (majority writes, gap detection, checksum validation), failure modes, and relationship to test/e2e. Split from documentdb#348 as a standalone reviewable PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

@hossain-rayhan

…eate) Address @hossain-rayhan feedback on documentdb#400: the doc didn't make explicit that a Fatal failure preserves the cluster for post-mortem rather than auto-recreating it, and that recovery is manually triggered after a maintainer reviews the alert from the monitor. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

xgerman · 2026-06-12T18:14:17Z

+| **Journal** | In-process append-only event log shared by all components. | Reproducible event stream for the report. |
+| **Report** | Aggregates the journal into a markdown summary at a configurable interval; raises alerts on threshold breaches. | Markdown report; alert lines. |
+
+### Cluster Topology


we should mention that where possible we want to reuse the code from the e2e tests (e.g. client)

Added a "Code reuse" paragraph at the end of the Architecture section in ab9a009 (will update SHA in reply if push lands differently):

Where possible, the driver consumes the same helpers as the e2e suite — the Mongo client, DocumentDB lifecycle operations (create / patch / wait-healthy / delete), and TLS plumbing all live in a shared test/shared Go module. This keeps long-haul behavior aligned with what e2e exercises and avoids two diverging mongo-driver wrappers.

That test/shared module is what #401 extracts — long-haul will consume it from day one.

xgerman · 2026-06-12T18:15:41Z

+
+## Lifecycle
+
+The test runs **continuously** — no cycles, no resets. Workload, metrics, operations, and health monitoring all run as long-lived processes. The system accumulates real state (PVC growth, CR history, operator memory) exactly as it would in production.


Please eleborate how we deal with different versions, e.g. are we always runnign the latest, current, etc. When are we updating? Part of the test?

Is there a point when we start over? What's the criteria?

xgerman · 2026-06-12T18:16:54Z

+
+The test runs **continuously** — no cycles, no resets. Workload, metrics, operations, and health monitoring all run as long-lived processes. The system accumulates real state (PVC growth, CR history, operator memory) exactly as it would in production.
+
+**Workload runs through upgrades.** No drain, no quiesce. Draining before upgrade hides exactly the upgrade-under-state bugs we're testing.


Aevwe also downgrading? Or are we starting over at some point so we can test upgrade more than once?

xgerman · 2026-06-12T18:18:01Z

+| **Lifecycle** | DocumentDB version upgrade, operator upgrade |
+| **HA** | controlled failover |
+| **Chaos** | kill primary pod, drain node |
+| **Data protection** | trigger backup, verify backup |


do we have operator upgrades as well?
Operator chaos?
Remote nodes/mukti-region? (maybe not now but potentially planned in the future)

xgerman · 2026-06-12T18:19:35Z

+- One disruptive op at a time. Overlapping disruptions are non-diagnosable.
+- Per-category cooldown between ops. Lets the cluster stabilize.
+- Steady-state gate — health check must pass before the next op fires.
+- Backup isolation — no topology changes during backup.


why not? Backup should block/delay - this should be handled by backup

Agreed — dropped the bullet (d1694f5). Backup-vs-topology is the backup feature's job; isolating it in the harness would hide exactly the bugs we want to catch.

xgerman · 2026-06-12T18:22:06Z

+**Per-component attribution.** Metrics are tagged by component (operator pod RSS, DB pod RSS, goroutine count, reconcile rate, API-call rate). Without separate series, a memory climb at hour 30 is undiagnosable.
+
+**Human-in-the-loop alerts.** The hourly monitor posts a summary to the workflow run and, when configured, to a chat channel. A maintainer reviews the evidence and manually creates a GitHub issue. No auto-filed issues — alert fatigue from transient or infrastructure failures would erode trust in the canary.
+


we shoudl also record the system dashboard metrics (latency, uptime, etc.) as well as logs of all components for later analysis; where do we keep them?

Added an Artifact Retention subsection in 314f3dc. Two tiers:

Rolling status — longhaul-report ConfigMap polled by the monitor workflow (this part is already used in the existing driver).

Forensics bundle — pod logs, events, CR snapshots, metric samples, journal — uploaded as a GitHub Actions artifact on every Tier-1 / Tier-2 alert and at end of run.

Operational details (which collectors, sanitization rules, bundle layout) are kept in test/longhaul/README.md rather than the design doc, so the design stays high-level.

xgerman · 2026-06-12T18:24:44Z

+| **CloudNative-PG** | Failover via pod delete + SIGSTOP; pod-level resource sampling | Ginkgo framework (we use a long-lived `Deployment` instead) |
+| **CockroachDB** | Chaos runner; separate workload from disruption; roachstress | Custom roachtest framework (too heavy) |
+| **Vitess** | Background stress goroutine; per-query tracking | No fault injection (we need disruptive ops) |
+


We are also interested how FoundationDB tests (they turned their approach into Anithesis) - not sure if they cover long haul though

Added FoundationDB and Antithesis as separate rows in the Learnings table (309aefc). Short answer to your question: neither covers long-haul — both run in simulated time on fake network/disk, so they catch rare-interleaving logic bugs in seconds but can't surface the wall-clock accumulation bugs (mem leaks, lock-table bloat, CR-history drift) that need real reconciliation cycles over real days. We adopt their property-based oracle and workload/fault separation.

xgerman · 2026-06-12T18:25:02Z

+
+## Open Questions
+
+1. Multi-region canary scope — AKS Fleet integration?


yes, at a later point

Agreed — renamed Open Questions -> Future Scope and reworded the multi-region item so it reads as explicitly deferred (still a candidate before GA if scope allows). See 8a79edd.

@xgerman

Per @xgerman feedback, call out that both Primary and Baseline run with production-style podAntiAffinity and a PodDisruptionBudget so chaos and upgrade operations exercise operator/DB bugs rather than misconfiguration failures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

@xgerman

Per @xgerman feedback, state explicitly that the driver reuses e2e helpers (Mongo client, DocumentDB lifecycle ops, TLS plumbing) from the shared test/shared Go module rather than forking them. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

@xgerman

Per @xgerman, multi-region (AKS Fleet) is deferred. Rename the Open Questions section to Future Scope and reword the item so the deferred status is explicit; no open design questions remain. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

@xgerman

Per @xgerman, call out FDB Simulation and Antithesis. Both are deterministic simulation tools that target rare-interleaving logic bugs in simulated time; they don't cover the wall-clock accumulation bugs that long-haul exists for. We adopt their property-based oracle and workload/fault separation, not the simulation engine itself. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

@xgerman

Per @xgerman, spell out where evidence is kept. Two tiers: a rolling status summary in the longhaul-report ConfigMap (already used by the monitor workflow), and a forensics bundle uploaded as a GitHub Actions artifact on alert and at end of run. Operational details (collectors, sanitization, layout) belong in test/longhaul/README.md, not the design. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

@xgerman

Per @xgerman, isolating backup from topology hides exactly the serialization bugs long-haul should catch. Backup-vs-topology is the backup feature's responsibility, not the harness's. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 10, 2026 13:41

WentingWu666666 requested review from alaye-ms, hossain-rayhan and xgerman as code owners June 10, 2026 13:41

Copilot started reviewing on behalf of WentingWu666666 June 10, 2026 13:41 View session

WentingWu666666 mentioned this pull request Jun 10, 2026

test: extract test/shared module from e2e helpers for long-haul reuse #401

Open

Copilot AI reviewed Jun 10, 2026

View reviewed changes

documentdb-triage-tool Bot added CI/CD dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation test labels Jun 10, 2026

WentingWu666666 closed this Jun 10, 2026

WentingWu666666 force-pushed the developer/wentingwu/longhaul-design-doc branch from 15eb6f4 to ff2c1cb Compare June 10, 2026 14:00

WentingWu666666 reopened this Jun 10, 2026

WentingWu666666 changed the title ~~docs(longhaul): update long-haul test design document (1/5 of #348)~~ docs(longhaul): add long-haul test design document (1/5 of #348) Jun 10, 2026

WentingWu666666 marked this pull request as draft June 10, 2026 14:30

WentingWu666666 force-pushed the developer/wentingwu/longhaul-design-doc branch 13 times, most recently from 031d0a5 to f848d41 Compare June 10, 2026 15:14

WentingWu666666 force-pushed the developer/wentingwu/longhaul-design-doc branch from f848d41 to fc797a3 Compare June 10, 2026 15:14

WentingWu666666 force-pushed the developer/wentingwu/longhaul-design-doc branch from fc797a3 to bdc8cf0 Compare June 10, 2026 15:18

WentingWu666666 changed the title ~~docs(longhaul): add long-haul test design document (1/5 of #348)~~ docs(longhaul): add long-haul test design document Jun 10, 2026

WentingWu666666 marked this pull request as ready for review June 10, 2026 15:20

hossain-rayhan reviewed Jun 10, 2026

View reviewed changes

Comment thread docs/designs/long-haul-test-design.md

hossain-rayhan approved these changes Jun 10, 2026

View reviewed changes

xgerman reviewed Jun 12, 2026

View reviewed changes

Copilot AI added 6 commits June 15, 2026 12:53


		## Lifecycle

		The test runs continuously — no cycles, no resets. Workload, metrics, operations, and health monitoring all run as long-lived processes. The system accumulates real state (PVC growth, CR history, operator memory) exactly as it would in production.


		The test runs continuously — no cycles, no resets. Workload, metrics, operations, and health monitoring all run as long-lived processes. The system accumulates real state (PVC growth, CR history, operator memory) exactly as it would in production.

		Workload runs through upgrades. No drain, no quiesce. Draining before upgrade hides exactly the upgrade-under-state bugs we're testing.

		Per-component attribution. Metrics are tagged by component (operator pod RSS, DB pod RSS, goroutine count, reconcile rate, API-call rate). Without separate series, a memory climb at hour 30 is undiagnosable.

		Human-in-the-loop alerts. The hourly monitor posts a summary to the workflow run and, when configured, to a chat channel. A maintainer reviews the evidence and manually creates a GitHub issue. No auto-filed issues — alert fatigue from transient or infrastructure failures would erode trust in the canary.


		## Open Questions

		1. Multi-region canary scope — AKS Fleet integration?

Conversation

WentingWu666666 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scope

Content

Verification

Related

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

documentdb-triage-tool Bot commented Jun 10, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

WentingWu666666 commented Jun 10, 2026 •

edited

Loading